Analyzing Factors of Failure and Success in Student Education¶

Collaborators: Russell Benjamin (renjamin), Jerry D. Li (jli103)

What is this?¶

As the title suggests, this is a tutorial which explores factors affecting one’s ability to continue and succeed in schooling across different age groups. This idea stems from a conversation Jerry and I had where we realized how vastly different our two backgrounds were before beginning our higher education at the University of Maryland. While Jerry spent his younger schooling years in China around the world, I was brought up in public and private schools around the DMV area. These different environments gave us unique challenges which have each shaped us in their own way. After talking through our experiences from grade school until now, we became curious about the different factors that makes a student succeed or fail in their education.

Education is widely known to be important, and many would argue essential. It shapes individuals and provides the ability to contribute meaningfully to society. One of its byproducts is it also enables people from completely different backgrounds to explore common interests, like Jerry and I. This got us wondering: what factors interfere with educational pursuit?

What is this tutorial's outcome?¶

In this tutorial, we’ll aim to explore obstacles in education and show you the entire data science lifecycle from start to finish. This will begin with data collection, then data processing, exploratory data analysis (EDA), building a model, and finally interpreting the results gathered and tying it back to our topic question.

Libraries¶

In [52]:
import pandas as pd
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.offsetbox import OffsetImage, AnnotationBbox
from IPython.display import Image, display, HTML

# Imports for Step 4
import pydotplus
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score 

Step 1: Data Collection¶

In this step, we will gather and comprise interesting data to explore through our tutorial.
Source links: (for the csv files, check out this repository https://github.com/jerryliUMD/cmsc320_final_tutorial)
[1]https://ourworldindata.org/child-maltreatment-and-educational-outcomes
[2]https://www.kaggle.com/datasets/jessemostipak/college-tuition-diversity-and-pay
[3]https://nces.ed.gov/programs/coe/indicator/cpb/college-enrollment-rate#:~:text=The%20overall%20college%20enrollment%20rate%20of%2018%2D%20to%2024%2Dyear,%2D%20or%204%2Dyear%20institutions
[4]https://www.kaggle.com/datasets/shariful07/student-mental-health
[5]https://archive-beta.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success
[6]https://research.com/education/college-drug-abuse-statistics

In [53]:
df_children_work = pd.read_csv('working-children-out-of-school-ages-7-14-vs-hours-worked-by-children-ages-7-14.csv')
df_usa_salary_potential = pd.read_csv('usa_salary_potential.csv')
df_usa_college_enroll_rate = pd.read_csv('usa_college_enroll_rate.csv')
df_usa_college_enroll_rate_ethnicity = pd.read_csv('usa_college_enroll_rate_ethnicity.csv')
df_malaysia_student_mental_health = pd.read_csv('malaysia_student_mental_health.csv')
df_drug_abuse_reasons = pd.read_csv('drug_abuse_top_reasons.csv')
df_europe_students = pd.read_csv('europe_college_student_data.csv', sep=';')
In [54]:
#If you want to see the whole table without the '...', you can use the following.
#pd.set_option('display.max_columns', None)  # Show all columns
#pd.set_option('display.max_rows', None)     # Show all rows

#If this is too much data, you can reset back to default (recommended).
#pd.reset_option('display.max_columns')
#pd.reset_option('display.max_rows')

def showTable(df, name):
    display(HTML(f"<div style='text-align:center;'><h4>{name}</h4></div>"))
    if name is not None:
        display(df)

Children working hours will be the data from various countries. Source organization https://ourworldindata.org/

In [55]:
showTable(df_children_work, 'Working children out of school vs. hours worked by children (age 7-14)')

Working children out of school vs. hours worked by children (age 7-14)

Entity Code Year Children in employment, work only (% of children in employment, ages 7-14) Average working hours of children, study and work, ages 7-14 (hours per week) Population (historical estimates) Continent
0 Abkhazia OWID_ABK 2015 NaN NaN NaN Asia
1 Afghanistan AFG 2011 50.0 13.1 29249156.0 NaN
2 Afghanistan AFG -10000 NaN NaN 14737.0 NaN
3 Afghanistan AFG -9000 NaN NaN 20405.0 NaN
4 Afghanistan AFG -8000 NaN NaN 28253.0 NaN
... ... ... ... ... ... ... ...
58142 Zimbabwe ZWE 2017 NaN NaN 14751101.0 NaN
58143 Zimbabwe ZWE 2018 NaN NaN 15052191.0 NaN
58144 Zimbabwe ZWE 2019 NaN NaN 15354606.0 NaN
58145 Zimbabwe ZWE 2020 NaN NaN 15669663.0 NaN
58146 Zimbabwe ZWE 2021 NaN NaN 15993525.0 NaN

58147 rows × 7 columns

Perhaps money is one thing that attracts people to go to universities. It is one of the motivations that drive students to persue higher education. Source organization https://www.kaggle.com/

In [56]:
showTable(df_usa_salary_potential, 'Potential Salary for College Graduates (USA)')

Potential Salary for College Graduates (USA)

rank name state_name early_career_pay mid_career_pay make_world_better_percent stem_percent
0 1 Auburn University Alabama 54400 104500 51.0 31
1 2 University of Alabama in Huntsville Alabama 57500 103900 59.0 45
2 3 The University of Alabama Alabama 52300 97400 50.0 15
3 4 Tuskegee University Alabama 54500 93500 61.0 30
4 5 Samford University Alabama 48400 90500 52.0 3
... ... ... ... ... ... ... ...
930 22 Viterbo University Wisconsin 46800 81900 62.0 3
931 23 Concordia University-Wisconsin Wisconsin 46700 81600 61.0 9
932 24 University of Wisconsin-Parkside Wisconsin 46000 81400 47.0 17
933 25 University of Wisconsin-River Falls Wisconsin 47100 81300 52.0 14
934 1 University of Wyoming Wyoming 52400 98800 58.0 25

935 rows × 7 columns

We would like to see college enrollment rate and ethnicity rate because knowing this could help us understand more about students' identity and see part of the college environment. Since the United States is a culturally diverse country, student diversity will be an interesting factor to consider. You can explore more details on your own. Source organization https://nces.ed.gov/

In [57]:
showTable(df_usa_college_enroll_rate, 'USA College Enroll Rate 2010 - 2021 for ages 18 to 24')
showTable(df_usa_college_enroll_rate_ethnicity, 'USA College Enroll Rate 2010 - 2021 Ethnicity ages 18 to 24')

USA College Enroll Rate 2010 - 2021 for ages 18 to 24

Year Total Total-Standard Error 2-year 2-year-Standard Error 4-year 4-year-Standard Error Unnamed: 7
0 2010 41.177719 0.571090 12.947770 0.356930 28.229949 0.528590 NaN
1 2011 41.980005 0.594780 12.027942 0.352910 29.952062 0.580250 NaN
2 2012 41.005519 0.618410 12.714605 0.384120 28.290914 0.577060 NaN
3 2013 39.934162 0.627760 11.593688 0.356900 28.340474 0.568490 NaN
4 2014 40.034789 0.651150 10.632735 0.398640 29.402053 0.609860 NaN
5 2015 40.457909 0.703051 10.576509 0.345376 29.881400 0.693661 NaN
6 2016 41.215745 0.706217 10.097104 0.358171 31.118641 0.642209 NaN
7 2017 40.411336 0.662280 10.011453 0.371220 30.399883 0.636670 NaN
8 2018 40.934275 0.677227 9.908324 0.369201 31.025951 0.640221 NaN
9 2019 40.671598 0.813270 10.282093 0.380550 30.389505 0.747883 NaN
10 2020 40.006512 0.699037 9.076408 0.360105 30.930105 0.662079 NaN
11 2021 38.056244 0.652770 8.308609 0.353400 29.747635 0.634653 NaN
12 NOTE: To estimate the margin of error, the sta... NaN NaN NaN NaN NaN NaN NaN
13 SOURCE: U.S. Department of Commerce, Census Bu... NaN NaN NaN NaN NaN NaN NaN

USA College Enroll Rate 2010 - 2021 Ethnicity ages 18 to 24

Race/ethnicity 2010 2010-Standard Error 2021 2021-Standard Error Unnamed: 5
0 Total 41.177719 0.57109 38.056244 0.652770 NaN
1 American Indian/Alaska Native 41.375935 6.59728 28.361443 6.373455 NaN
2 Asian 63.611214 2.69639 60.475638 2.857108 NaN
3 Black 38.440417 1.65739 36.694730 1.853708 NaN
4 Hispanic 31.911062 1.15244 33.434163 1.356797 NaN
5 Pacific Islander 35.963047 8.36290 44.751146 10.297929 NaN
6 White 43.310902 0.81159 38.327076 0.930767 NaN
7 Two or more races 38.262145 4.37737 35.069703 4.045329 NaN
8 NOTE: To estimate the margin of error, the sta... NaN NaN NaN NaN NaN
9 SOURCE: U.S. Department of Commerce, Census Bu... NaN NaN NaN NaN NaN

While students are in school, mental health could be an issue. Maybe it is because they want to work hard to get a good job, or to achieve something. The stress level in different courses could be an indicator of this. Here, we found data from International Islamic University Malaysia in year 2020 to showcase this. (We are interested in age, course, and mental state.) Source organization https://www.kaggle.com/

In [58]:
showTable(df_malaysia_student_mental_health, 'Student Mental Health in International Islamic University Malaysia in year 2020')

Student Mental Health in International Islamic University Malaysia in year 2020

Timestamp Choose your gender Age What is your course? Your current year of Study What is your CGPA? Marital status Do you have Depression? Do you have Anxiety? Do you have Panic attack? Did you seek any specialist for a treatment?
0 8/7/2020 12:02 Female 18.0 Engineering year 1 3.00 - 3.49 No Yes No Yes No
1 8/7/2020 12:04 Male 21.0 Islamic education year 2 3.00 - 3.49 No No Yes No No
2 8/7/2020 12:05 Male 19.0 BIT Year 1 3.00 - 3.49 No Yes Yes Yes No
3 8/7/2020 12:06 Female 22.0 Laws year 3 3.00 - 3.49 Yes Yes No No No
4 8/7/2020 12:13 Male 23.0 Mathemathics year 4 3.00 - 3.49 No No No No No
... ... ... ... ... ... ... ... ... ... ... ...
96 13/07/2020 19:56:49 Female 21.0 BCS year 1 3.50 - 4.00 No No Yes No No
97 13/07/2020 21:21:42 Male 18.0 Engineering Year 2 3.00 - 3.49 No Yes Yes No No
98 13/07/2020 21:22:56 Female 19.0 Nursing Year 3 3.50 - 4.00 Yes Yes No Yes No
99 13/07/2020 21:23:57 Female 23.0 Pendidikan Islam year 4 3.50 - 4.00 No No No No No
100 18/07/2020 20:16:21 Male 20.0 Biomedical science Year 2 3.00 - 3.49 No No No No No

101 rows × 11 columns

Drug abuse is often thought of as a direct cause to educational performance. Because it became an obstacle that is usually considered as 'common sense', we would like to look into this. Here is a student dataset from the University of Sao Paulo to show this. Source organization https://research.com/

In [59]:
showTable(df_drug_abuse_reasons, 'Common Reasons for Drug Abuse')

Common Reasons for Drug Abuse

Category Top Reasons For Drug Abuse Among College Students
0 Influence by peers 96.6%
1 Curiosity 93.3%
2 Search for fun 93.3%
3 School-related stress 86.6%
4 Living away from family 80%
5 Media influence 56.6%

We found a dataset that includes information of European college students. Source organization https://archive-beta.ics.uci.edu/

In [60]:
showTable(df_europe_students, 'Europe College Student Info')

Europe College Student Info

Marital status Application mode Application order Course Daytime/evening attendance Previous qualification Previous qualification (grade) Nacionality Mother's qualification Father's qualification ... Curricular units 2nd sem (credited) Curricular units 2nd sem (enrolled) Curricular units 2nd sem (evaluations) Curricular units 2nd sem (approved) Curricular units 2nd sem (grade) Curricular units 2nd sem (without evaluations) Unemployment rate Inflation rate GDP Target
0 1 17 5 171 1 1 122.0 1 19 12 ... 0 0 0 0 0.000000 0 10.8 1.4 1.74 Dropout
1 1 15 1 9254 1 1 160.0 1 1 3 ... 0 6 6 6 13.666667 0 13.9 -0.3 0.79 Graduate
2 1 1 5 9070 1 1 122.0 1 37 37 ... 0 6 0 0 0.000000 0 10.8 1.4 1.74 Dropout
3 1 17 2 9773 1 1 122.0 1 38 37 ... 0 6 10 5 12.400000 0 9.4 -0.8 -3.12 Graduate
4 2 39 1 8014 0 1 100.0 1 37 38 ... 0 6 6 6 13.000000 0 13.9 -0.3 0.79 Graduate
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4419 1 1 6 9773 1 1 125.0 1 1 1 ... 0 6 8 5 12.666667 0 15.5 2.8 -4.06 Graduate
4420 1 1 2 9773 1 1 120.0 105 1 1 ... 0 6 6 2 11.000000 0 11.1 0.6 2.02 Dropout
4421 1 1 1 9500 1 1 154.0 1 37 37 ... 0 8 9 1 13.500000 0 13.9 -0.3 0.79 Dropout
4422 1 1 1 9147 1 1 180.0 1 37 37 ... 0 5 6 5 12.000000 0 9.4 -0.8 -3.12 Graduate
4423 1 10 1 9773 1 1 152.0 22 38 37 ... 0 6 6 6 13.000000 0 12.7 3.7 -1.70 Graduate

4424 rows × 37 columns

Still alive? Keep reading.

Step 2: Data Processing¶

Many times, the data we get is messy. In this process, we will clean some data. Check for NaN values and duplicate rows first. Drop unnecessary columns, and maybe change the column name if needed.

In [61]:
def peekData(df):
    print('---------------------------------------------------')
    print('Table NaN values count for each column\n') 
    print(df.isna().sum(), '\n')
    print('Table duplicated rows count', df.duplicated().sum(), '\n')
    print(df.info())
    print('---------------------------------------------------')
In [62]:
peekData(df_children_work)
---------------------------------------------------
Table NaN values count for each column

Entity                                                                               0
Code                                                                              3281
Year                                                                                 0
Children in employment, work only (% of children in employment, ages 7-14)       57865
Average working hours of children, study and work, ages 7-14 (hours per week)    58015
Population (historical estimates)                                                   48
Continent                                                                        57862
dtype: int64 

Table duplicated rows count 0 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58147 entries, 0 to 58146
Data columns (total 7 columns):
 #   Column                                                                         Non-Null Count  Dtype  
---  ------                                                                         --------------  -----  
 0   Entity                                                                         58147 non-null  object 
 1   Code                                                                           54866 non-null  object 
 2   Year                                                                           58147 non-null  int64  
 3   Children in employment, work only (% of children in employment, ages 7-14)     282 non-null    float64
 4   Average working hours of children, study and work, ages 7-14 (hours per week)  132 non-null    float64
 5   Population (historical estimates)                                              58099 non-null  float64
 6   Continent                                                                      285 non-null    object 
dtypes: float64(3), int64(1), object(3)
memory usage: 3.1+ MB
None
---------------------------------------------------

For the children work dataset, there are quite a lot of NaN values, so doing something like imputation won't work. In this case, we are interested in entity (country), year, population, and child work hours. So, we can drop the rows that has NaN values in these columns to then see what result we get. Also, the column name is too long, so we will shorten the name.

In [63]:
#Don't need Continent and Code column, drop it.
df_children_work.drop(columns=['Continent'], inplace=True)

df_children_work.rename(
    columns={'Entity': 'Country',
             'Children in employment, work only (% of children in employment, ages 7-14)': 'Work_only_pct', 
             'Average working hours of children, study and work, ages 7-14 (hours per week)': 'Work_avg',
             'Population (historical estimates)': 'Population'}, 
    inplace=True
)

df_children_work[(df_children_work['Year'] >= 1999) & (df_children_work['Year'] <= 2016)]

df_children_work = df_children_work.dropna(subset=['Work_only_pct', 'Work_avg', 'Population'])

#if you want to double check, uncomment this line.
#peekData(df_children_work) 

df_children_work.head()
Out[63]:
Country Code Year Work_only_pct Work_avg Population
1 Afghanistan AFG 2011 50.000000 13.1 29249156.0
598 Albania ALB 2010 5.100000 13.6 2913402.0
855 Algeria DZA 2013 4.508655 3.6 38000628.0
1487 Angola AGO 2001 26.600000 12.5 16941584.0
2389 Armenia ARM 2010 0.000000 5.4 2946296.0
In [64]:
peekData(df_usa_salary_potential)
---------------------------------------------------
Table NaN values count for each column

rank                          0
name                          0
state_name                    0
early_career_pay              0
mid_career_pay                0
make_world_better_percent    33
stem_percent                  0
dtype: int64 

Table duplicated rows count 1 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 935 entries, 0 to 934
Data columns (total 7 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   rank                       935 non-null    int64  
 1   name                       935 non-null    object 
 2   state_name                 935 non-null    object 
 3   early_career_pay           935 non-null    int64  
 4   mid_career_pay             935 non-null    int64  
 5   make_world_better_percent  902 non-null    float64
 6   stem_percent               935 non-null    int64  
dtypes: float64(1), int64(4), object(2)
memory usage: 51.3+ KB
None
---------------------------------------------------

For the USA college graduate potential salary, we care about the pay, so we will drop the irrelevant column "make_world_better_percent".

In [65]:
df_usa_salary_potential = df_usa_salary_potential.drop(columns=['make_world_better_percent'])
df_usa_salary_potential = df_usa_salary_potential.drop_duplicates(keep='last') #keep only one of the duplicated rows.
df_usa_salary_potential.head()
Out[65]:
rank name state_name early_career_pay mid_career_pay stem_percent
0 1 Auburn University Alabama 54400 104500 31
1 2 University of Alabama in Huntsville Alabama 57500 103900 45
2 3 The University of Alabama Alabama 52300 97400 15
3 4 Tuskegee University Alabama 54500 93500 30
4 5 Samford University Alabama 48400 90500 3
In [66]:
peekData(df_usa_college_enroll_rate)
peekData(df_usa_college_enroll_rate_ethnicity)
---------------------------------------------------
Table NaN values count for each column

Year                      0
Total                     2
Total-Standard Error      2
2-year                    2
2-year-Standard Error     2
4-year                    2
4-year-Standard Error     2
Unnamed: 7               14
dtype: int64 

Table duplicated rows count 0 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Year                   14 non-null     object 
 1   Total                  12 non-null     float64
 2   Total-Standard Error   12 non-null     float64
 3   2-year                 12 non-null     float64
 4   2-year-Standard Error  12 non-null     float64
 5   4-year                 12 non-null     float64
 6   4-year-Standard Error  12 non-null     float64
 7   Unnamed: 7             0 non-null      float64
dtypes: float64(7), object(1)
memory usage: 1.0+ KB
None
---------------------------------------------------
---------------------------------------------------
Table NaN values count for each column

Race/ethnicity          0
2010                    2
2010-Standard Error     2
2021                    2
2021-Standard Error     2
Unnamed: 5             10
dtype: int64 

Table duplicated rows count 0 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Race/ethnicity       10 non-null     object 
 1   2010                 8 non-null      float64
 2   2010-Standard Error  8 non-null      float64
 3   2021                 8 non-null      float64
 4   2021-Standard Error  8 non-null      float64
 5   Unnamed: 5           0 non-null      float64
dtypes: float64(5), object(1)
memory usage: 612.0+ bytes
None
---------------------------------------------------

From the US college enrollment and ethnicity rate table, we will get rid of the NaN values and then the last 2 rows (we can see why from Step 1 Data Collection). The column names look clean and the overall table is small. Easy to process :)

In [67]:
df_usa_college_enroll_rate_enthnicity = df_usa_college_enroll_rate_ethnicity.copy()
df_usa_college_enroll_rate = df_usa_college_enroll_rate.iloc[:-2, :-1] #get rid of last 2 rows and last column
df_usa_college_enroll_rate_enthnicity = df_usa_college_enroll_rate_enthnicity.iloc[:-2,:-1]

df_usa_college_enroll_rate_enthnicity.rename(
    columns = {
        'Race/ethnicity': 'Race',
        '2010-Standard Error': '2010-std-error',
        '2021-Standard Error': '2021-std-error'
    },
    inplace = True
)

#uncomment to double check
display(df_usa_college_enroll_rate.head())
display(df_usa_college_enroll_rate_enthnicity.head())
Year Total Total-Standard Error 2-year 2-year-Standard Error 4-year 4-year-Standard Error
0 2010 41.177719 0.57109 12.947770 0.35693 28.229949 0.52859
1 2011 41.980005 0.59478 12.027942 0.35291 29.952062 0.58025
2 2012 41.005519 0.61841 12.714605 0.38412 28.290914 0.57706
3 2013 39.934162 0.62776 11.593688 0.35690 28.340474 0.56849
4 2014 40.034789 0.65115 10.632735 0.39864 29.402053 0.60986
Race 2010 2010-std-error 2021 2021-std-error
0 Total 41.177719 0.57109 38.056244 0.652770
1 American Indian/Alaska Native 41.375935 6.59728 28.361443 6.373455
2 Asian 63.611214 2.69639 60.475638 2.857108
3 Black 38.440417 1.65739 36.694730 1.853708
4 Hispanic 31.911062 1.15244 33.434163 1.356797
In [68]:
peekData(df_malaysia_student_mental_health)
---------------------------------------------------
Table NaN values count for each column

Timestamp                                       0
Choose your gender                              0
Age                                             1
What is your course?                            0
Your current year of Study                      0
What is your CGPA?                              0
Marital status                                  0
Do you have Depression?                         0
Do you have Anxiety?                            0
Do you have Panic attack?                       0
Did you seek any specialist for a treatment?    0
dtype: int64 

Table duplicated rows count 0 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 11 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   Timestamp                                     101 non-null    object 
 1   Choose your gender                            101 non-null    object 
 2   Age                                           100 non-null    float64
 3   What is your course?                          101 non-null    object 
 4   Your current year of Study                    101 non-null    object 
 5   What is your CGPA?                            101 non-null    object 
 6   Marital status                                101 non-null    object 
 7   Do you have Depression?                       101 non-null    object 
 8   Do you have Anxiety?                          101 non-null    object 
 9   Do you have Panic attack?                     101 non-null    object 
 10  Did you seek any specialist for a treatment?  101 non-null    object 
dtypes: float64(1), object(10)
memory usage: 8.8+ KB
None
---------------------------------------------------

In the student mental health dataset, since there is only 1 missing value in the age column, probably missing at random. We can take the age mean to fill the NaN value. Also, we will change the column names. The timestamp here does not interest us because it is only 2-3 months data in year 2020 (not wide enough), thus drop it. We will be interested in the survey question answers.

In [69]:
#Filling age with mean 
df_malaysia_student_mental_health["Age"].fillna(round(df_malaysia_student_mental_health["Age"].mean()),inplace=True)

#get rid of not interested columns
df_malaysia_student_mental_health.drop(columns=['Timestamp', 'Choose your gender', 
                                                'Marital status', 'Your current year of Study'], inplace=True)

#rename long column names
df_malaysia_student_mental_health.rename(
    columns = {
        'What is your course?': 'Course', 
        'What is your CGPA?': 'CGPA', 
        'Do you have Depression?': 'Depression', 
        'Do you have Anxiety?': 'Anxiety',
        'Do you have Panic attack?': 'Panic_attack', 
        'Did you seek any specialist for a treatment?': 'Treatment'}, 
    inplace = True
)

# Convert "yes" and "no" to 1 and 0
df_malaysia_student_mental_health['Depression'] = df_malaysia_student_mental_health['Depression'].apply(lambda x: 1 if x == 'Yes' else 0)
df_malaysia_student_mental_health['Anxiety'] = df_malaysia_student_mental_health['Anxiety'].apply(lambda x: 1 if x == 'Yes' else 0)
df_malaysia_student_mental_health['Panic_attack'] = df_malaysia_student_mental_health['Panic_attack'].apply(lambda x: 1 if x == 'Yes' else 0)
df_malaysia_student_mental_health['Treatment'] = df_malaysia_student_mental_health['Treatment'].apply(lambda x: 1 if x == 'Yes' else 0)

df_malaysia_student_mental_health['CGPA'] = df_malaysia_student_mental_health['CGPA'].str.replace(' ', '').str.replace('-', ' - ')
df_malaysia_student_mental_health['Course'] = df_malaysia_student_mental_health['Course'].str.lower().str.replace(' ', '')

df_malaysia_student_mental_health.head()
Out[69]:
Age Course CGPA Depression Anxiety Panic_attack Treatment
0 18.0 engineering 3.00 - 3.49 1 0 1 0
1 21.0 islamiceducation 3.00 - 3.49 0 1 0 0
2 19.0 bit 3.00 - 3.49 1 1 1 0
3 22.0 laws 3.00 - 3.49 1 0 0 0
4 23.0 mathemathics 3.00 - 3.49 0 0 0 0

From the drug abuse table, we will rename the features for clearer interpretation. We will also parse the percent feature values of their % signs for easier plotting later on.

In [70]:
peekData(df_drug_abuse_reasons)
---------------------------------------------------
Table NaN values count for each column

Category                                             0
Top Reasons For Drug Abuse Among College Students    0
dtype: int64 

Table duplicated rows count 0 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
 #   Column                                             Non-Null Count  Dtype 
---  ------                                             --------------  ----- 
 0   Category                                           6 non-null      object
 1   Top Reasons For Drug Abuse Among College Students  6 non-null      object
dtypes: object(2)
memory usage: 228.0+ bytes
None
---------------------------------------------------
In [71]:
df_drug_abuse_reasons.rename(columns={'Category' : 'Reason'}, inplace=True)
df_drug_abuse_reasons.rename(columns={'Top Reasons For Drug Abuse Among College Students' : 'Percent'}, inplace=True)

df_drug_abuse_reasons['Percent'] = df_drug_abuse_reasons['Percent'].str.replace('%', '').astype(float)

For the Europe student dataset, there are a total of 36 features available to us. We will drop a handful of these to hone in on those of interest for later visualizations.

In [72]:
peekData(df_europe_students)
---------------------------------------------------
Table NaN values count for each column

Marital status                                    0
Application mode                                  0
Application order                                 0
Course                                            0
Daytime/evening attendance                        0
Previous qualification                            0
Previous qualification (grade)                    0
Nacionality                                       0
Mother's qualification                            0
Father's qualification                            0
Mother's occupation                               0
Father's occupation                               0
Admission grade                                   0
Displaced                                         0
Educational special needs                         0
Debtor                                            0
Tuition fees up to date                           0
Gender                                            0
Scholarship holder                                0
Age at enrollment                                 0
International                                     0
Curricular units 1st sem (credited)               0
Curricular units 1st sem (enrolled)               0
Curricular units 1st sem (evaluations)            0
Curricular units 1st sem (approved)               0
Curricular units 1st sem (grade)                  0
Curricular units 1st sem (without evaluations)    0
Curricular units 2nd sem (credited)               0
Curricular units 2nd sem (enrolled)               0
Curricular units 2nd sem (evaluations)            0
Curricular units 2nd sem (approved)               0
Curricular units 2nd sem (grade)                  0
Curricular units 2nd sem (without evaluations)    0
Unemployment rate                                 0
Inflation rate                                    0
GDP                                               0
Target                                            0
dtype: int64 

Table duplicated rows count 0 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4424 entries, 0 to 4423
Data columns (total 37 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   Marital status                                  4424 non-null   int64  
 1   Application mode                                4424 non-null   int64  
 2   Application order                               4424 non-null   int64  
 3   Course                                          4424 non-null   int64  
 4   Daytime/evening attendance                      4424 non-null   int64  
 5   Previous qualification                          4424 non-null   int64  
 6   Previous qualification (grade)                  4424 non-null   float64
 7   Nacionality                                     4424 non-null   int64  
 8   Mother's qualification                          4424 non-null   int64  
 9   Father's qualification                          4424 non-null   int64  
 10  Mother's occupation                             4424 non-null   int64  
 11  Father's occupation                             4424 non-null   int64  
 12  Admission grade                                 4424 non-null   float64
 13  Displaced                                       4424 non-null   int64  
 14  Educational special needs                       4424 non-null   int64  
 15  Debtor                                          4424 non-null   int64  
 16  Tuition fees up to date                         4424 non-null   int64  
 17  Gender                                          4424 non-null   int64  
 18  Scholarship holder                              4424 non-null   int64  
 19  Age at enrollment                               4424 non-null   int64  
 20  International                                   4424 non-null   int64  
 21  Curricular units 1st sem (credited)             4424 non-null   int64  
 22  Curricular units 1st sem (enrolled)             4424 non-null   int64  
 23  Curricular units 1st sem (evaluations)          4424 non-null   int64  
 24  Curricular units 1st sem (approved)             4424 non-null   int64  
 25  Curricular units 1st sem (grade)                4424 non-null   float64
 26  Curricular units 1st sem (without evaluations)  4424 non-null   int64  
 27  Curricular units 2nd sem (credited)             4424 non-null   int64  
 28  Curricular units 2nd sem (enrolled)             4424 non-null   int64  
 29  Curricular units 2nd sem (evaluations)          4424 non-null   int64  
 30  Curricular units 2nd sem (approved)             4424 non-null   int64  
 31  Curricular units 2nd sem (grade)                4424 non-null   float64
 32  Curricular units 2nd sem (without evaluations)  4424 non-null   int64  
 33  Unemployment rate                               4424 non-null   float64
 34  Inflation rate                                  4424 non-null   float64
 35  GDP                                             4424 non-null   float64
 36  Target                                          4424 non-null   object 
dtypes: float64(7), int64(29), object(1)
memory usage: 1.2+ MB
None
---------------------------------------------------
In [73]:
kept_features = ['Marital status', 'Application mode', 'Application order', 'Course', 'Daytime/evening attendance',
            'Previous qualification', 'Nacionality', "Mother's qualification", "Father's qualification", "Mother's occupation",
            "Father's occupation", 'Displaced', 'Educational special needs', 'Debtor',
            'Tuition fees up to date', 'Gender', 'Scholarship holder', 'Age at enrollment', 'International',
            'Unemployment rate', 'Inflation rate', 'GDP', 'Curricular units 1st sem (approved)',
            'Curricular units 1st sem (grade)', 'Curricular units 2nd sem (approved)',
            'Curricular units 2nd sem (grade)','Target']

df_europe_students = df_europe_students[kept_features]

df_europe_students.head()
Out[73]:
Marital status Application mode Application order Course Daytime/evening attendance Previous qualification Nacionality Mother's qualification Father's qualification Mother's occupation ... Age at enrollment International Unemployment rate Inflation rate GDP Curricular units 1st sem (approved) Curricular units 1st sem (grade) Curricular units 2nd sem (approved) Curricular units 2nd sem (grade) Target
0 1 17 5 171 1 1 1 19 12 5 ... 20 0 10.8 1.4 1.74 0 0.000000 0 0.000000 Dropout
1 1 15 1 9254 1 1 1 1 3 3 ... 19 0 13.9 -0.3 0.79 6 14.000000 6 13.666667 Graduate
2 1 1 5 9070 1 1 1 37 37 9 ... 19 0 10.8 1.4 1.74 0 0.000000 0 0.000000 Dropout
3 1 17 2 9773 1 1 1 38 37 5 ... 20 0 9.4 -0.8 -3.12 6 13.428571 5 12.400000 Graduate
4 2 39 1 8014 0 1 1 37 38 9 ... 45 0 13.9 -0.3 0.79 5 12.333333 6 13.000000 Graduate

5 rows × 27 columns

Step 3: Exploratory Data Analysis(EDA) & Visualization¶

Children Labor Hours dataset EDA¶

The data for children work may not be perfect to visualize. However, let's first see the overall view, and then do a scatter plot.

In [74]:
df_children_work = df_children_work.sort_values(by='Year')

#Uncomment these to see details
#display(df_children_work)
years = df_children_work['Year'].unique()
display(df_children_work[['Work_only_pct', 'Work_avg', 'Population']].describe())
Work_only_pct Work_avg Population
count 132.000000 132.00000 1.320000e+02
mean 21.194017 13.61419 3.247172e+07
std 18.363822 7.41143 5.065799e+07
min 0.000000 1.90000 1.019369e+06
25% 6.539222 8.37500 5.986888e+06
50% 17.528625 13.05000 1.237385e+07
75% 32.422263 16.50000 2.939637e+07
max 89.345680 40.30000 2.440162e+08

Let's do a visualization. We could try to do a 3D scatter plot, but doing a 2D plot will look more pleasing. First normalize the population. Although I care most about children work hours, it is also interesting to see the population at the same time. In seaborn, do sizes=(20, 300), which is like doing normalization. Why not standardize? Because I want to emphasize the proportions of the population, so normalization is good. If you want to emphasize the distribution of values, then standarization is good.

In [75]:
# Filter out the specific warning
warnings.filterwarnings("ignore", category=UserWarning, message="The figure layout has changed to tight")

# Create a FacetGrid of scatter plots
g = sns.FacetGrid(df_children_work, col='Year', col_wrap=5, height=3, sharex=False, sharey=False)
g.map_dataframe(
    sns.scatterplot,
    x='Work_avg',
    y='Work_only_pct',
    size='Population',
    hue='Country',
    sizes=(20, 200),
    alpha=0.7,
)

g.set_titles(col_template="Year {col_name}")
g.set_axis_labels('Work Average (hr/week)', 'Work Only Percentage')
g.fig.suptitle('Scatter Plots: Work vs. Work Only Percentage by Year for Children age 7-14', y=1.02)
plt.subplots_adjust(top=0.9, hspace=0.5) 

#let's track some countries
track_countries = ['Nigeria', 'Gambia', 'Bangladesh', 'Kenya', 'Bolivia']

# Iterate through subplots and add annotations
for ax in g.axes.flat:
    year = ax.get_title().split(' ')[-1]
    data_year = df_children_work[df_children_work['Year'] == int(year)]
    for _, row in data_year.iterrows():
        if row['Country'] in track_countries:
            annotation_text = f"{row['Code']}"  # Combine country and year
            ax.annotate(annotation_text, (row['Work_avg'], row['Work_only_pct']),
                        textcoords="offset points", xytext=(0, 2), ha='center')

plt.show()
No description has been provided for this image

For each country there is a color. We can see from the plot, there are so many colorful "bubbles" that represent the population proportion. Also, not all countries have children work data for each year. So, the overall trend of the data is inexplainable. Places like KEN, BGD, and GMB showed some interesting data.

For KEN, in year 2000, the work average is about 9.5 hrs and work only percentage is around 14 percent; in year 2009, the number for work average jumped to 32 hrs and work only percent jumped to about 38 pct.

For BGD, in year 2006, the work avg is 10 hrs and work only percent is 40; in year 2013, the work avg increased to 30 hrs and work only percent is 60!

For GMB, in year 2008, the work average is about 13 hrs and work only percent is 26; in year 2015, the work average decrease to roughly 10.5 hrs per week but work only percent is 40 percent plus!

We are showing this data to you because we think children working between age 7 - 14 could be an indication that their family is somehow financially unstable, which could lead the children to think that making money is worth more than education. Maybe this is inevitable for people born as being the n_th poor generation under some political system, carrying inequality arguments with them. By seeing the increase and decrease in work only percentage, we start to wonder if people can see the value in education.

USA College Potential Salaries by State dataset EDA¶

To see people's motivation in education, one of the things is money. We found a dataset that can show the potential salary for some USA colleges after students graduate. Let's go.

In [76]:
# We are more interested in the bigger picture, such as the State Data in the USA.
# We will extract some USA states we thought could be interesting to see since the education quality is well known.

targets = ['Maryland', 'Colorado',  'Massachusetts', 'Virginia', 'New-York']

# Calculate mean values for each state
mean_values = df_usa_salary_potential.groupby('state_name')[['early_career_pay', 'mid_career_pay', 'stem_percent']].mean().reset_index()

# Let's see some colorful plots
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(30, 8))


bar_colors = ['skyblue' if state not in targets else 'darkblue' for state in mean_values['state_name']]
axes[0].bar(mean_values.index, mean_values['early_career_pay'], color=bar_colors)
axes[0].set_title('Mean Early Career Pay')
axes[0].set_ylabel('Salary')
axes[0].set_xticks(mean_values.index)
axes[0].set_xticklabels(mean_values['state_name'], rotation=90, ha='right')

bar_colors = ['lightgreen' if state not in targets else 'darkgreen' for state in mean_values['state_name']]
axes[1].bar(mean_values.index, mean_values['mid_career_pay'], color=bar_colors)
axes[1].set_title('Mean Mid Career Pay')
axes[1].set_ylabel('Salary')
axes[1].set_xticks(mean_values.index)
axes[1].set_xticklabels(mean_values['state_name'], rotation=90, ha='right')

bar_colors = ['lightcoral' if state not in targets else 'darkred' for state in mean_values['state_name']]
axes[2].bar(mean_values.index, mean_values['stem_percent'], color=bar_colors)
axes[2].set_title('Mean STEM %')
axes[2].set_ylabel('Percent')
axes[2].set_xticks(mean_values.index)
axes[2].set_xticklabels(mean_values['state_name'], rotation=90, ha='right')

# Show plot
plt.show()
No description has been provided for this image

The mean early potential salary for fresh college graduates are mostly in the range [40000, 55000]. Interestingly, California State has the highest average potential pay of all time, yet lower STEM percentage than New-York. By looking at our target states (the dark highlighted bars), the career pay for Maryland and Virginia colleges are about the same (little less than 100k); Maryland beats Virginia with the STEM percentages being higher. Massachusetts have a pretty high range from each of the plot [~80500, ~115000, ~27 pct]. Colorado is relatively lower than other states for salaries. While some states in the USA looks attracting to us, we encourage you to explore this financial aspect because it could be insightful. It seems like going to college can make decent money because we can learn something we couldn't on our own. To extend this idea, let's learn about college diversity.

USA College Enrollment rate & Ethnicity Enrollment Rate dataset EDA¶

We will look at the USA college enrollment rate data to see how college enrollment could reflect on something.

In [77]:
# Set up the plot
plt.figure(figsize=(15, 6))

# Load custom marker images
img_2_year = plt.imread('./marker1.png')
img_4_year = plt.imread('./marker1.png')
img_total = plt.imread('./marker2.png')

# Plot the lines
plt.errorbar(df_usa_college_enroll_rate['Year'], df_usa_college_enroll_rate['2-year'], 
             yerr=df_usa_college_enroll_rate['2-year-Standard Error'], 
             marker='', color='skyblue', linewidth=2.0, label='2-year')

plt.errorbar(df_usa_college_enroll_rate['Year'], df_usa_college_enroll_rate['4-year'], 
             yerr=df_usa_college_enroll_rate['4-year-Standard Error'], 
             marker='', color='lightgreen', linewidth=2.0, label='4-year')

plt.errorbar(df_usa_college_enroll_rate['Year'], df_usa_college_enroll_rate['Total'], 
             yerr=df_usa_college_enroll_rate['Total-Standard Error'], 
             marker='', color='orange', linewidth=2.0, label='Total Enroll')

# Add custom markers for 2-year and 4-year using AnnotationBbox
for year, y2, y4, total in zip(df_usa_college_enroll_rate['Year'], df_usa_college_enroll_rate['2-year'], 
                               df_usa_college_enroll_rate['4-year'], df_usa_college_enroll_rate['Total']):
    ab_2_year = AnnotationBbox(OffsetImage(img_2_year, zoom=0.035), (year, y2 + 2.8), frameon=False)
    plt.gca().add_artist(ab_2_year)
    ab_4_year = AnnotationBbox(OffsetImage(img_4_year, zoom=0.035), (year, y4 + 2.8), frameon=False)
    plt.gca().add_artist(ab_4_year)
    ab_total = AnnotationBbox(OffsetImage(img_total, zoom=0.035), (year, total + 2.8), frameon=False)
    plt.gca().add_artist(ab_total)

# Customize the plot
plt.xlabel('Year')
plt.ylabel('Enrollment Rate (%)')
plt.title('The overall USA college enrollment rate for 18- to 24-year-olds', y=1.05)
plt.legend()

plt.ylim(min(df_usa_college_enroll_rate['2-year']) - 1, max(df_usa_college_enroll_rate['4-year']) + 30)
plt.grid(False)
plt.tight_layout()
plt.show()
No description has been provided for this image

From year 2010 to 2021, the 2-year college enrollment dropped from 14 percent-ish to blow 10 percent, and the 4-year college enrollment maintains up and down on the 30 percent line. The overall college enrollment seems to decrease from 2018 to 2021 overtime. To our surprise, it looks like going to college is like a privilege in the US because in the course of 10 years, the college enrollment rate is always less than 50 percent! Perhaps it is the money that is driving young people away even though the potential early career and mid career salary is decent (common sense). We couldn't find the financial piece from this same dataset. You see, if we are data scientists in the wild, it is always a good idea to first figure out what kinds of data do we want to get. That way, at least, we will have strong supporting evidences. So maybe the potential salary motivation has less effect on this indirectly.

We learned about the overall enrollment rates. Now a common thing to further explore is the diversity of the data, such as the ethnicity of the population. Why? It is because USA is a very diverse country (melting pot). Thus, we present to you:

In [78]:
# Set up the plot
plt.figure(figsize=(15, 5))

# Calculate the width of each bar group
bar_width = 0.35

# Create positions for the bars
positions_2010 = list(range(len(df_usa_college_enroll_rate_enthnicity)))
positions_2021 = [pos + bar_width for pos in positions_2010]

# Plot the bars for 2010 and 2021
plt.barh(positions_2010, df_usa_college_enroll_rate_enthnicity['2010'], 
         height=bar_width, color='skyblue', alpha=0.8, label='2010')

plt.barh(positions_2021, df_usa_college_enroll_rate_enthnicity['2021'], 
         height=bar_width, color='coral', alpha=0.8, label='2021')

# Set y-ticks and labels
plt.yticks([pos + bar_width / 2 for pos in positions_2010], df_usa_college_enroll_rate_enthnicity['Race'])

# Customize the plot
plt.xlabel('Enrollment Rate (%)')
plt.ylabel('Race/Ethnicity')
plt.title('USA College Enrollment Rates by Race/Ethnicity: 2010 vs 2021', y=1.05)
plt.legend()
plt.grid(False)
# Display the plot
plt.tight_layout()
plt.show()
No description has been provided for this image

From the same dataset as college enrollment rate from 2010 to 2021 ages 18-24, we see that Asian have the hightest enrollment rate of all time (60 pct +). Then comes White and Pacific islander (~ 40 pct avg). Black is just slightly less than 40 percent in avg. Perhaps Asians are more prone to be "high college grades equals successful"? Even if that is the case, we doubt this could make an Asian student successful in education. Because focusing solely on grades will be really stressful. Whether people view college as a resource, a place to show intelligence, a factory to produce high quality labors, or a self-improvement place, these chaotic thoughts will be transformed and become "temporarily ordered" by schools; that is, you will learn how things work, what is best or worse, what other people are thinking, etc.

Malaysia College Student Mental Health dataset EDA¶

Here, we will see the mental aspect in education. Enjoy :)

In [79]:
df_malaysia_copy = df_malaysia_student_mental_health.copy()

# See the sum counts of Age and CGPA
grouped_age = df_malaysia_copy.groupby(['Age']).sum().reset_index()
grouped_cgpa = df_malaysia_copy.groupby(['CGPA']).sum().reset_index()
grouped_course = df_malaysia_copy.groupby(['Course']).sum().reset_index()

# Create a list of groupings and their corresponding dataframes
groupings = [('Age', grouped_age), ('CGPA', grouped_cgpa), ('Course', grouped_course)]

# Plotting
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(30, 8))

for i, (group_name, grouped_df) in enumerate(groupings):
    ax = axes[i]
    stacked_cols = ['Depression', 'Anxiety', 'Panic_attack']
    grouped_df.plot(ax=ax, x=group_name, y=stacked_cols, kind='bar', stacked=True)

    ax.set_xlabel(group_name)
    ax.set_ylabel('Count')
    ax.set_title(f'Counts of Depression, Anxiety, and Panic Attacks by {group_name}')
    ax.legend(title='Mental Health Condition')

# Adjust layout and display the plot
plt.grid(False)
plt.tight_layout()
plt.show()
No description has been provided for this image

The data plot shows us very interesting results! To be honest, our intuition for CGPA is: students will get really stressful with low scores, resulting depression. It turns out for students from International Islamic University Malaysia in year 2020 who has [0 - 1.99], [2.00 - 2.49], and [2.50 - 2.99] CGPA ranges have less mental health issues compared to students in the [3.00 - 3.49] and [3.50 - 4.00] ranges. It seems like to be "successful" in college in this case (high grades) is actually not successful in terms of mental health because students care too much to an extend grades become greater than other stuffs; otherwise, they wouldn't possibly be decently depressed, anxious, and panicked.

Normally, a student will go to college at the age of 18 and then graduate college at 22 (typical 4-year college life). It looks like students at the age of 18 and 19 in college are suffering a lot of mental health problems. Is it because they are not "mature enough" to handle college life, such as living away from home or figuring out the college course load? Students who are 20, 21, and 22 have a decrease in mental health problems. Perhaps they have a clearer path they want to work on, so less stress? The intuitive age for graduate students is 23 and 24, so we will probably see depression and anxiety here. However, without more data, we couldn't come up with more reliable testing on our assumptions.

When we look at the courses, bcs, bit, engineering, koe, and psychology stood out to our sight. Although we have no idea what bcs, bit, and koe means here even when we went to the university's website to check, it is expected to see engineering as one of the most anxiety-inducing courses. Do grades really make a student successful, such as getting the potential salary out of the school? Of course, we all need money to survive, but does having the luxury to spend money and time in college imply student success? Remember, we care only the educational life. For outside school, the situation varies a lot. So far as we can see, the way students might see "success" in education v.s the way we start to see "success" in education is different. Maybe learning to enjoy the moment is what makes us successful.

Drug Abuse Statistics dataset EDA¶

Drugs. No doubt it is a factor that can influence students negatively in education, making students fail.

In [80]:
plt.figure(figsize=(15, 4))
plt.barh(df_drug_abuse_reasons['Reason'], df_drug_abuse_reasons['Percent'])
plt.xlabel('Percent %')
plt.ylabel('Reasons')
plt.title('Top Reasons For Drug Abuse Among College Students')
plt.tight_layout()
plt.show()
No description has been provided for this image

It turns out 90 percent plus students do drugs because of their peers. It could indirectly show that students haven't developed a mind set of having their own logical thinking of the consequences. Curiosity and search for fun is the nature of ignorant students. Living away from family could be a factor to cause a student to do drugs, whether this is showing disagreement, rebellion, or a-way-to-escape, the emotion here dominates the logical behavior. Media influence is something interesting to know. Perhaps the educators failed to inform students bad things about drugs. For your own exploration, we hope you could find something interesting by extending this dataset.

Europe College Student Info dataset EDA¶

For this dataset, we would show you a basic plot. For more visualizations on this dataset, see Step 4 below.

In [81]:
df_europe_copy = df_europe_students.copy()

# Counting the number of classes in 'target'
class_counts = df_europe_copy['Target'].value_counts()

# Plotting a bar plot
class_counts.plot(kind='bar', figsize=(15, 4), color='skyblue')

# Adding labels and title
plt.xlabel('Target Classes')
plt.ylabel('Count')
plt.title('Number of Instances in Each Target Class')
plt.grid(False)

# Display the plot
plt.tight_layout()
plt.show()
No description has been provided for this image

Interestingly, we have a dataset of "successful" students based on graduation. It is also interesting to see that there are about 1500 dropouts while we can observe about 2200 graduates. In the dataset, there is a Target (enrollment status) and Age at Enrollment column. Let's see what could happen if we plot them together.

In [82]:
plt.figure(figsize=(15, 5))
sns.violinplot(data=df_europe_students, x='Target', y='Age at enrollment', palette='pastel')

# Adding labels and title
plt.xlabel('Enrollment Status')
plt.ylabel('Age')
plt.title('Violin Plot of Age by Enrollment Status')

# Display the plot
plt.tight_layout()
plt.show()
No description has been provided for this image

From the age violin plot, we see that the median is about 20 years old. People who are 17 - 20 years old are enrolled in college, and 19 - 24 years old graduated. People who are about 25 years older tend to dropout less, and people who are younger tend to dropout more. The data is unsurprisingly skewed to younger people.

Step 4: Model: Analysis, Hypothesis Testing, & ML¶

The Europe student dataset was quite extensive originally coming with 36 features available to us. We trimmed off a few columns during Step 2 and will now build a variety of machine learning models to try and accurately classify an individual as a dropout, student, or graduate.

As a preliminary step, we will encode our three label categories (dropout, student, graduate) as 0, 1, or 2 for easier processing. This is necessary because some algorithms depend on it and cannot operate with categorical or string-based labels.

In [83]:
df_europe_students['Target'] = df_europe_students['Target'].map({
    'Dropout': 0,
    'Enrolled': 1,
    'Graduate': 2,
})
In [84]:
df_europe_students.dtypes
Out[84]:
Marital status                           int64
Application mode                         int64
Application order                        int64
Course                                   int64
Daytime/evening attendance               int64
Previous qualification                   int64
Nacionality                              int64
Mother's qualification                   int64
Father's qualification                   int64
Mother's occupation                      int64
Father's occupation                      int64
Displaced                                int64
Educational special needs                int64
Debtor                                   int64
Tuition fees up to date                  int64
Gender                                   int64
Scholarship holder                       int64
Age at enrollment                        int64
International                            int64
Unemployment rate                      float64
Inflation rate                         float64
GDP                                    float64
Curricular units 1st sem (approved)      int64
Curricular units 1st sem (grade)       float64
Curricular units 2nd sem (approved)      int64
Curricular units 2nd sem (grade)       float64
Target                                   int64
dtype: object

Great! Here we can observe the uniform data type of each feature and can note that the target feature is represented as an integer. Onto the next step!

A correlation matrix, sometimes called a heatmap, is an extremely valuable tool in data analysis and what we'll implement first. This is a square matrix that houses the correlation coefficient pairings between each feature in a particular dataset. Here, we'll be able to observe what features are positively, negatively, or not correlated, meaning how they tend to change relative to each other.

Let's generate a correlation matrix to get an insight into how our features behave together.

In [85]:
corr_matrix = df_europe_students.corr()

plt.figure(figsize=(10,10))
sns.heatmap(corr_matrix)
plt.title('Correlation Matrix')
Out[85]:
Text(0.5, 1.0, 'Correlation Matrix')
No description has been provided for this image

Alright, here's the correlation matrix we wanted! Correlation matrices are also often called heatmaps and the 'temperature' indicates the correlation between features mentioned above. This is a great first insight into the inner workings of the dataframe and we can see that most features are either not correlated or negatively correlated amongst each other. In a real world setting, this is to be expected and a practical result.

We've just generated a n*n correlation matrix, but what might help us more specifically is to hone in on the relationship between our features and target label specifically. Let's create a new visualization to model this.

In [86]:
target_corr = df_europe_students.corrwith(df_europe_students['Target'])
plt.figure(figsize=(20, 5))
sns.heatmap(target_corr.to_frame(), annot=True, cmap='coolwarm', center=0, linewidths=0.5)
plt.title('Correlation of Target vs Features')
plt.show()
No description has been provided for this image

Now, we have another heatmap generated that is n*1 and models the relationship between each feature and our target. Based on the results of the heatmap, we can extract the k highest correlated features to our target label.

An important distinction to first make is that while correlation does not imply causation, we can still reason as to why we believe these trends to be occurring with that caution in mind.

In [87]:
# Set the value of k (number of highest correlations to extract)
k = 10

target_corr = corr_matrix['Target']
target_corr = target_corr.sort_values(ascending=False)[1:(k + 1)]

print("Top", k, "highest correlations:")
print(target_corr)
Top 10 highest correlations:
Curricular units 2nd sem (approved)    0.624157
Curricular units 2nd sem (grade)       0.566827
Curricular units 1st sem (approved)    0.529123
Curricular units 1st sem (grade)       0.485207
Tuition fees up to date                0.409827
Scholarship holder                     0.297595
Displaced                              0.113986
Application order                      0.089791
Daytime/evening attendance             0.075107
GDP                                    0.044135
Name: Target, dtype: float64

The output above gives us some insight into the features most strongly correlated to our target label. One might infer academic performance and grades play a strong influence on student retention. We're also able to see other trends like areas of finance (i.e. tuition, scholarship, living conditions) having some form of association.

Naturally, we can reason as to why this is. If an individual is unable to perform at the needed academic level and keep up with course rigor, they're more inclined to leave schooling. Similarly, the cost of higher education has seen a significant increasing trend across several years, making aid such as scholarships more and more in demand. If students are unable to manage and keep up with their finances, they have no choice but to forgo their studies.

Making use of the Scikit-learn library and their pre-implemented models, we will run various algorithms like KNN (K-nearest neighbors), SVM (Support Vector Machines), logistic regression, decision trees, and random forests on the Europe student dataset.

To begin building said models, we need to split our dataset into training and testing subsets. The X training and testing data are feature vectors without their corresponding label. The y training and testing data are said corresponding labels for each of those observations.

There are trade-offs to the size of our training/testing split but we will opt to use the conventionally standard 80/20 metric where 80% is used to train and 20% is used to test.

In [88]:
X = df_europe_students.drop('Target', axis=1)
y = df_europe_students['Target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

KNN (K-nearest neighbors)¶

First, we will try K-nearest neighbors. This algorithm works by reasoning that data points with similar features tend to belong to the same classification. It finds the "k" nearest data points to an input with a given distance metric (i.e., Euclidean distance) and then takes the majority of the neighbors' labels to classify the input.

In [89]:
knn_model = KNeighborsClassifier(n_neighbors=5)

knn_model.fit(X_train, y_train)

y_pred = knn_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"K-Nearest Neighbor Accuracy: {accuracy}")
K-Nearest Neighbor Accuracy: 0.6225988700564972

Here, we can observe that the accuracy of KNN isn't as high as we'd like it to be. This could be due to a number of reasons such as choosing an appropriate value of k. Choosing k is rather nuanced and here we're using an arbitrary number, five.

One thing to highlight is that our KNN model may be suffering from a phenomenon known as the "curse of dimensionality." KNN's performance can degrade as the number of features (dimensions) in the dataset increases. In high-dimensional spaces, data points tend to be more spread out, making it difficult to find meaningful nearest neighbors.

This aligns with our dataset which originally started with 36 features which we trimmed down during data processing. If we kept all 36 original features, we may have observed even worse performance from the model.

Logistic Regression¶

Next, we will try to use logistic regression. Conventionally, logistic regression is a binary classifier algorithm. However, there is such a thing as multiclass logistic regression, also known as softmax regression, which Scikit-learn defaults to when necessary. Essentially, it predicts the probability of each class for some input and uses the softmax function to normalize the probabilities so they all add up to 1. The class with the highest probability gets predicted as the final output!

In [90]:
logreg_model = LogisticRegression(solver='liblinear')

logreg_model.fit(X_train, y_train)

y_pred = logreg_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Logistic Regression Accuracy: {accuracy}")
Logistic Regression Accuracy: 0.7615819209039548

We see a significant improvement with our logistic regression model! One thing to note is that logistic regression tends to perform well when there is a well-balanced representation of classes amongst our examples in the dataset. As observed in Step 2, there is a rather uneven split of graduates, dropouts, and lastly enrolled students when it comes to the frequency of their class representation. If we could better balance this, perhaps if the study surveyed more people, we could see even better results.

SVM (Support Vector Machines)¶

Another model we can try is SVM (Support Vector Machines). In particular, we will use multiclass SVM to align with our three classifiers (dropout, enrolled, graduate). Standard SVM extends a hyperplane, or in laymans' terms, a divider to apply a binary classification on the dataset. In a two-dimensional space, this can be thought of as a dividing line where you either are or are not something.

Multiclass SVM extends this logic and extends a hyperplane for each class as needed. Let's see how it performs!

In [91]:
def run_svm(model):
    svm_model.fit(X_train, y_train)

    y_pred = svm_model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)

    print(f"SVM Accuracy with {svm_model.kernel}: {accuracy}")

svm_model = SVC(C=1.0)
run_svm(svm_model)
svm_model.kernel = 'linear'
run_svm(svm_model)
SVM Accuracy with rbf: 0.5028248587570622
SVM Accuracy with linear: 0.7570621468926554

Wow, we're getting two very different results here! But first... what are rbf and linear?

These are what we call kernel functions and are an important part of how SVM models operate. When a hyperplane can't be cleanly established on a dataset, we might need to apply a kernel function. This is a mathematical function that maps data points from their original feature space to a higher dimensionality. A good example of this is if we were working on a 2-dimensional feature space and our data points were jumbled up, we may not be able to draw a line that perfectly separates our different classes. A kernel function could extend these data points to a three dimensional plane such that we can more easily do this! In other words, a kernel function gives us an extra layer of flexibility when our datasets are proving slightly difficult to work with!

By using a linear kernel function, we're able to observe a much better result than the default rbf kernel!

Decision Trees¶

Another fun model to try is what's called a decision tree. We can almost think of this as a flowchart where the internal nodes represent a decision our model makes based on a specific feature. We'll see this flowchart in a little bit! This leads to branches that represent outcomes or further decisions. At the end of the branches, you find the predicted outcome or classification based on the path taken through the tree.

In [92]:
dt_model = DecisionTreeClassifier(criterion="entropy")

dt_model.fit(X_train,y_train)

y_pred = dt_model.predict(X_test)

print(f"Accuracy Score: {accuracy_score(y_test, y_pred) * 100:.2f}%")
Accuracy Score: 67.80%

Yikes! This isn't looking too good. Something to mention is that because of the 'flowchart' nature of the decision tree algorithm, there's a very important constraint we have control of called the max depth. Essentially, we'll be able to control how tall our tree gets. This is especially important for large datasets like ours where we have an abundance of features and observations.

Not to mention, if we let the algorithm decide on a depth of its own, we get a really, really confusing tree. Let's look at that for a minute.

In [93]:
feature_cols = df_europe_students.drop('Target', axis=1).columns

dot_data = export_graphviz(dt_model, out_file=None, feature_names=feature_cols, class_names=['Dropout', 'Graduate', 'Enrolled'], filled=True, rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.680293 to fit

Out[93]:
No description has been provided for this image

Not too pretty right? Let's pick a max depth that makes this easier on the eyes.

In [94]:
dt_model = DecisionTreeClassifier(criterion="entropy", max_depth=3)

dt_model.fit(X_train,y_train)

y_pred = dt_model.predict(X_test)

print(f"Accuracy Score: {accuracy_score(y_test, y_pred) * 100:.2f}%")
Accuracy Score: 74.69%
In [95]:
feature_cols = df_europe_students.drop('Target', axis=1).columns

dot_data = export_graphviz(dt_model, out_file=None, feature_names=feature_cols, class_names=['Dropout', 'Graduate', 'Enrolled'], filled=True, rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
Out[95]:
No description has been provided for this image

By constraining the depth that we allow our decision tree to grow to, we're able to see slightly improved results. This is because the model is more likely able to make correct classifications on the test data solely based on the most highly correlated features which are chosen for splits toward the root of the tree. As our tree grows and we begin to rely on features with less correlation, our model is more prone to make errors!

Random Forest¶

The last model we'll try is called Random Forest and it actually builds on decision trees! Essentially, we built n decision trees to make up a forest. Each tree in the forest works on a different scrambling of the training and test data. The final prediction from the random forest is then a combination of the predictions made by each individual tree which generally leads to a more accurate model.

In [96]:
rf_model = RandomForestClassifier(n_estimators=100)

rf_model.fit(X_train, y_train)

y_pred = rf_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print(f"Random Forest Accuracy: {accuracy}")
Random Forest Accuracy: 0.7864406779661017

Step 5: Interpretation: Insight & Policy Decision¶

Through our exploration of seven different datasets, we conducted a thorough analysis of the different factors that play into student retention. From factors of health, to finance, and working conditions we were able to take away a few points.

Here are a few:

  • Our analysis of post-college salaries in the US indicates that there is a significant payoff in pursuing a higher education.
  • Based on our tracking of child labor in a few countries, it makes us wonder if education is truly valued across different areas in the world. In some places it may be out of necessity and we believe the presence of children working between age 7 - 14 reflects this.
  • Taking into consideration population influx, our research into college enrollment by ethnicity revealed a drastic decrease in attendance by Native Americans and an overarching decrease in college attendance across all ethnicities.
  • While looking into the mental wellbeing of Malaysian students, we observed a clear trend of mental health issues with high-performing students. This leads us to infer that persisting mental health issues as a byproduct of academic rigor may be a deterrant in continuing one's education.
  • There is a large amount of drug use in college for predominantly social reasons and self-fulfillment. We believe this may also tie into our findings regarding poor mental health and high academic performance. This inference needs to be expanded on with more research in the future for concrete evidence.
  • From our exploration of the Europe student dataset, we found that student grades, tuition, reception of scholarships, and displacement were most correlated with our classification of dropout, enrolled, or graduate. While we can't assert that correlation implies causation, this further supports our hypothesis that one's academic performance and financial flexibility play key roles in student retention.

In conclusion, our findings on ethnicity and finance reveal that there may be generational factors at play for certain groups of people which affect one's ability to attend school. Further, there are constant pressures of upkeeping mental health while performing well in school which may lead to drug use and other forms of escapism to get by. If we were to suggest policy implementations from these findings, we would advise these issues are reoccuring trends that may have accumulated throughout history. In order to improve the desire for education and the retention of those in schooling, we would suggest continued research into student wellbeing, generational obstacles, and healthier recreation for college students.